1 Introduction

In recent years, in order to meet the needs of sport coaches or media industry, the application of statistical analysis in the field of sports has developed rapidly. The ability to accurately predict the outcome of sporting matches is gaining popularity. For example, sport matches forecasting could allow media channels to provide more in-depth coverage of sporting events or allow tournament organizers to better match players’ skills to provide closer, more exciting matches. Tennis as one of the appealing sports,is well-suited for forecasting research because historical match results are publicly available for most pro matches so it is chosen in this project. The purpose of this project is to develop a method for predicting match outcomes and accounting for player-to-player matchup effects on men’s and women’s professional tennis matches. This project examines several probabilistic models for forecasting the outcomes of professional tennis levels, such as ATP or WTA matches. These models do not only focus on the prediction results in the tennis matched, but also consider the effect of player head-to-heads. In addition, the effects of surface type about the variation of players’ ability would be addressed in the models as one of the exploratory variables and random effect variables. The models are trained and evaluated on the two historical tennis matches results of approximately 4000 tennis matches between 2010 and 2020 after filtering top 100 players.

2 The motivation of this project

The main objective of the project is to develop and evaluate different approaches for predicting the tennis match results. In addition, by considering the head to head effects which might exist in the matches, this project would investigate strategies for estimating and incorporating these matchup effects into a prediction model. Furthermore, the ability of players in the different surface court would be explored in the modeling process as well.

3 Data description

There are two data sets atp_results and wta_resultsused on this analysis report, and the two data sets consist of approximately 36,000 matches information from 2011-01-01 to 2020-01-23 in ATP tour and WTA tour. Both of them are obtained from GitHub skoval, which have similar data structures.

The table 3.1 summarizes the key information which is contained about each match in the data. The data set also contain further information such as player id, date of tennis tournaments, Name of the match, the around in a tennis match and ect. The full list is omitted as the prediction model is more concerned about the forecasting result of tennis matches and none of this additional information is used by the models in this project. So, the only four key variable were mainly used for this project.

Theoretically, the full data sets should be used for data modeling, but the ranking data about top 100 players in ATP or WTA will be introduced in the following section and combined them with the full data to construct the new model data which is used for the data modeling in this project report due to the system operation problem. Also, one of the reasons for these top 100 player sample is to develop a general framework more quickly by focusing on a relevant but smaller-scale problem.

Table 3.1: Key match informatiom
Variable Description
Surface Match Surface
Player1 Match Winner
Player2 Match Loser
Tier Professional tennis tour tournament

4 Head to Head effect Analysis

Head to head means the direct competition between two players. In this project, head-to-head effect indicates that the matchup effect of two specific players meeting in the matches. This is not only to consider the player’s ability in tennis skills such as the talented-levels, but more importantly to detect whether there is a certain effect between the two players such as one player might have over a specific opponent that goes beyond what their ability would predict, factors like playing style, familiar with their opponents even the psychological impact.

On the other hand, small sample size is typical of head-to-head effects in professional tennis. From the figure 4.1 shows the players who have met each other more than 10 times in the past decade. Only a few high ranked players met each other frequently and the rest only meet a few times. In the ATP data, Djokovic met Nadal 29 times, as the current number one and third male player in the world rankings. It is clear that it has fewer players who have met each other more than 10 times in the WTA data. The players who faced each other most often are Serena and Azarenka,which have met 15 times.

The players who have met each other more than 10 times

Figure 4.1: The players who have met each other more than 10 times

4.1 Evaluating the head to head effect in each tier

Additionally, tennis sports is like a pyramid, with the size of the competitive pool getting lower when the players go upward the tournament tier as the stronger players will stay until the end. In this case,if there are many cases of player-opponent matchups with a single match, this will affect the accuracy of estimations of the head-to-head effects and bring all the effects closer to zero that would be the case with a less sparse sample.

Therefore, from the figure 4.2,it displays the number of matched with different times of meeting in each tier in the ATP data. Clearly, having played more than 3 matches against an opponent is an unusual long history compared to the normal matches. In the ATP data, there are four tier groups and ATP WorldTour 250 has most of matches records which is the lowest tier of annual men’s tennis tournaments on the main ATP Tour. In the ATP Matser1000,there are 753 matches among the players who have met more than three time each other, and only 615 matches have taken place in the ATP WorlTour250. It is obviously that the number of matches against the same opponent will be less and less in each tier especially in ATP WorldTour 500.

A similar situation occurs in WTA data. Based on the plot 4.3 shows that less than chance of playing more than 2 or 3 matches against the same opponent happened in all tiers. It is worth noting that WTA Tour will rename its tournaments in 2021 which will simplify the understanding of the structure of the tour. Therefore, the premier category becomes WTA 1000s and 500s and international category become WTA 250s, much like the ATP tour tournaments. In addition, 125K serier is like the challenger level in ATP, which is the entry-level for tour tournaments. But from the table data, the number of matches occurred in the 125K Series is the lowest, suggesting that players at the 125K Series will not often build up substantial head-to-heads against other players on the tour. According to the data from these two tables, it indicates that head-to-head matches are infrequent and that the number of matches in each tier that involve player-opponent matchup effects in the long history is very small.

The number of matches for players met in different number of times in ATP

Figure 4.2: The number of matches for players met in different number of times in ATP

The number of matches for players met in different number of times in WTA

Figure 4.3: The number of matches for players met in different number of times in WTA

4.2 Evaluating the H2H effect in each surface

Similarly, the figure 4.4 and 4.5 represent the number of tennis matches occurred in the different surface type in the last decay. Obviously, the majority of matches, and even matchups more than three times, take place on hard courts, where movement tends to be medium to fast because the court absorbs little energy and players are able to use a variety of spins during the match. In addition, Clay court is the second highest frequency playing surface and it examines the return ability of players as the balls bounce relatively high and lose a lot of initial velocity on contact with the water. Moreover, the number of matches for players has met more than three times are small in these two tables, suggesting that the number of matches in head to head effect is relatively small even in ten year record.

The number of matches for players met in different number of times in each surface in ATP

Figure 4.4: The number of matches for players met in different number of times in each surface in ATP

The number of matches for players met in different number of times in each surface in WTA

Figure 4.5: The number of matches for players met in different number of times in each surface in WTA

In the following section, a variety of prediction methods will be introduced formally, and evaluated the estimates for parameters of the model and model performance.

5 Modeling section

The main approaches to tennis match prediction in this project fall into two categories: simple pair comparison models and Bayesian Paired Comparison models in Stan. This section provides part of the technical description for models trained during this project.

5.1 Pairwise Comparison Model

Pairwise comparison is the process of comparing entities in pairs to determine which one has a better chance of winning. One of the famous and more effective method for establishing pairwise comparison is Bradley-Terry Model(Bradley and Terry 1952), applying in the contest between two players in tennis(McHale and Morton 2011).

Let \(y_{ij}\) be the outcome of a tennis match that is played between player i and player j at time t. We assume that we have information about K different players over a time period of length n (i.e.: i, j =1…. ,K). The outcome \({y_{ij} =1}\) if player i wins the match at time t whereas the outcome \({y_{ij} =0}\) if player j wins the match at time t. Therefore, the outcomes about \(y_{ij}\) are binary response value.

The conditional probability that \({y_{ij} =1}\) is given by: \[P_{ij}=P(y_{ij}=1|\delta_{ij})=\frac{exp(\delta_{ij})}{1+exp(\delta_{ij})};\] \[\delta_{ij}=\lambda_{i}-\lambda_{j}\] where\(\lambda_{i}\) represents the skill level of player i and \(\lambda_{j}\) represents the skill level of player j. The conditional probability of \({y_{ij} =0}\) is instead equal to \(1-P_{ij}\).

On the other hand, this model can be alternatively expressed in the following logit-linear way, which can be reduce the model to logistics regression on pairs of individuals: \[\mathrm{P}(\left[ i \mbox{ beats } j \right]=logit(P(i>j))=log(\frac{P(i>j)}{1-P(i>j)})=\lambda_{i}-\lambda_{j}\] where \(\lambda_{i}\)=\(log(\alpha_{i})\) for all i and \(\alpha_i;\alpha_j\) are the positive-valued parameters,respectively skill parameter of the players. Optimization can be used to jointly solve for skills parameters of all players based on a fixed period of historical results. Match winning probabilities can then be derived based on the same i.i.d(independent identifying distributed) assumption used in point models(Klaassen and Magnus 2001). Therefore, based on the IID assumption for all contests,the parameters \(\lambda_{i}\) can be estimated by maximum likelihood using BradleyTerry2 package for generalized linear models(Turner, Firth, and others 2012).

The BradleyTerry2 package provides a more flexible user interface to allow a wider range of models to be fitted in the data and allows the inclusion of simple random effects, where the \(\lambda_{i}\) can be related to explanatory variable via a linear predictor of the form: \[\lambda_{i}=\sum_{r=1}^{p}{\beta_{r}x_{ir}+U_{i}}\]

The inclusion of the prediction error \(U_i\) allows for variability between players with equal covariate values and induces correlation between comparisons with a common player.

5.2 Bayesian Paired Comparison models in Stan

1. The Bayesian method applied in Bradley-Terry model:

Bayesian inferences are based on a posterior probability density distribution (further referred to as the posterior) regarding the parameters, given the data collected. This allows for a more intuitive interpretation of parameters uncertainty. Basically, the Bayesian model can be broken down as consisting of:

  • Observed variables: The tennis matches outcomes are given. A single outcome will be given the notation \(r\) and a set of outcomes \(D\).

  • Unobserved variables: The parameters are chosen as part of model design,representing generally by the notation \(w\).

  • prior probability: \(P(w)\). A standard normal prior is used in this project.

The future outcomes can be predicted based on the relationship \(P(r|w)\) if given the parameters of the model. However,since the parameters are unknown, the future value become inferred based on the historical data. For some data D, on a set of n outcomes, the posterior probability of the given model parameters can be expressed as: \[P(w|D)=\frac{P(D|w)P(w)}{P(D)} \propto P(D|w)P(w)\] where \(P(D|w)\) is the likelihood of the model given the data and \(P(D)\) is marginal likelihood of the model found through normalisation. Under the assumption that outcomes are independent identified distributed, the likelihood can be expressed as: \[P(D|w)=\prod_{i = 1}^{N} P(r^i|w)\]

Moreover, for the prediction with new match \(r^{n+1}\),it can be made by taking the expectation of \(P(r^{n+1}|w)\) with respect to the posterior distribution \(P(w|D)\). However, for the models in this project, determining this expectation and the normalisation constant of the posterior distribution \(P(w|D)\) requires doing integrals which are intractable. One of the common method to achieve the predicted value is used an approximate inference technique like Bayesian approach to approximate the full posterior distribution.

Applying Bayesian method in Bradley Terry Model(Peters, n.d.), the extension formula is shown as below:

\[\mathrm{P}(\left[ i \mbox{ beats } j \right])=P(r_{ij}=1|\lambda_{i},\lambda_{j})=\frac{exp(\lambda_{i})}{exp(\lambda_{i})+exp(\lambda_{j})}=\frac{1}{1+e^{-(\lambda_{i}-\lambda_{j})}}=\sigma(\lambda_{i}-\lambda_{j}) \]

where\(r_{ij}=1\) indicates a win for player i against player j,\(\lambda_{i}\) and \(\lambda_{j}\) are the player skills which are the parameters of the model, and \(\sigma\) is the logistic sigmoid function: \(\frac{1}{1+e^{-x}}\).

For the some of data consisting of a set of match outcomes \(D={r^1,r^2...r^n}\) for a set of players \(M ={1,2,...,m}\) with ability \(\lambda={\lambda_{1},\lambda_{2},...,\lambda_{m}}\),the likelihood of the model given the data can be expressed as: \[P(D|\lambda)=\prod_{i = 1}^{N} \sigma(\lambda_{a}^i-\lambda_{b}^i)\]

In this project, the \(\lambda\) will be estimated as the prior probability with standard normal distribution (\(\lambda_{std} \sim N(0,1)\);\(\lambda \sim N(0,\lambda_{std}^2)\) and then added the observed match outcomes constantly to evaluate if the model strengthens or weakens the prior probability, so that the estimated posterior probability which will be closer to the actual value.

2. Mixed effect model in Stan (Surface random effect ):

Böckenholt (2001) decomposed the paired comparison model into fixed and a random effect components. The random effects component estimates the subject variation (given S subjects) in each item, while the fixed effect component such as \(\lambda_{i,s}\) and \(\lambda_{j,s}\) estimates the average log ability of the player. The random effects term is represented by \(U_{i,s}\), where i refers to the player being judged and s to the subject. The Bayesian Bradley-Terry model with random effects can be represented as: \[\mathrm{Pr}\left[ i \mbox{ beats } j \right]=\frac{exp(\lambda_{i,s})}{exp(\lambda_{i,s})+exp(\lambda_{j,s})};\] \[\lambda_{i,s}=\lambda_{i}+U_{i,s};\]

\[y_{ij}\sim Bernoulli(\mathrm{P}\left[ i \mbox{ beats } j \right]);\] \[ \lambda_{i} \sim N(0,\sigma_{\lambda}^2),[Prior];\] \[ U_{i,s} \sim N(0,U_{std}^2),[Prior];\] \[ U_{std} \sim N(0,\sigma_{U}^2),[Prior]\]

In this report, \(s\) represents different types of surface so \(U_{std}\) represents the standard deviation in the random effects and the difference between subjects. Therefore, the surface variable should be the random intercept as the grouped variables to evaluate the ability of each players within different surfaces. By doing this way, added the surface as a benchmark to represent the matches in which each pair players on different field. Moreover,the parameters \(\lambda_{i}\) can be used to rank the different players. However, in the Bayesian framework, a single measure is not obtained but rather a posterior distribution of the parameters \(\lambda_{i}\). By sampling from the posterior distribution of the log-abilities of the players, it is possible to create a posterior distribution of the ranks of the players, which helps to evaluate the uncertainty in the ranking system.

6 Impelementation

This section discusses the implementation of three different prediction models and is structured as follows:

  • Data re-preparation: briefly discusses the new model data about ranking data with the top 100 players.

  • Model process: mainly introduces the process of modeling in each prediction model.

  • Model output analysis: discusses the findings and results from the models.

6.1 Data re-preparation

For avoiding the technical issue such as the system crash, overload and time consuming, a subset data by selecting top 100 players for ATP and WTA to build up the model efficiently instead of using the full data set with ATP and WTA matches the record.

Therefore, atp_ranking and wta_ranking were collected from the official websites, atpworldtour.com and wtatennis.com respectively, with the rankings of the top 100 women and 100 men and they joined with the full dataset so the players were limited in the ranking of top 100 in both male and female for the final model data and it will be used for whole modeling parts.

6.2 Model Impletment

6.2.1 Standard Bradley-Terry Model

Modeling process:

The simple Bradley-Terry Model is implemented by the BradleyTerry2 package. There are two main functions used in this project:

  • BTm function: constructs the Bradley-Terry model with the matrix of p0_count and p1_count from the top 100 players data

  • BTabilities function: computes the ability of each player from the model objects against the baseline player.

Model output analysis:

  • Ability table interpretation:

Figure 6.1: The estimated ability of different tennis players in ATP

Figure 6.2: The estimated ability of different tennis players in WTA

From the table 6.1 and 6.2, Adrian Mannarino and Ajla Tomljanovic became the baseline against with other players on the ATP and WTA data as their ability are 0. In the ATP data, Novak Djokovic, Rafael Nadal and Roger Federer have the highest estimated ability compared with the Adrian Mannarino. All three of them have played and won most of matches and their ranking almost maintain at the top 10 of the world on average, indicating that their tennis level are higher than other players. On the contrary, Lorenzo Musetti and Sebastian Korda have the lowest estimated ability and one the reason for that they have only played one game and even lost in that match, so their ability is estimated quite lower.

Similarly, in the WTA model, Serena Williams have the highest estimated tennis skill, whereas Clara Tauson have the lowest estimated tennis ability compared with Ajla. That could be caused by the fact that Serena won the most of matches like more than 50 matches based on the data and Clara only had played two matches but she lost both of them.

6.2.2 Bayesian Paired Comparison models in Stan

6.2.2.1 Bradley-Terry model with head-to-head effect

Modeling process:

  • model_stan_data:construct the Stan model data with the total number of matches and the number of matches of player0 winning in each pair of players

  • bpc_h2h_pred.stan: is the model script file. Instead of only using simple Bradley Terry model with bernoulli distribution in a single match,the Stan model data collapse the matches between every pair of players so it had been became the binomial distribution, which is structured as: \(y \sim binomial(n,p)\). The main parameters are used in the Stan model:

    • \(n\): is the N trials which is the total number of matches in each pair player.

    • \(y\): is the number of success which is the number of times that player1 has won the match in this case.

    • \(\lambda\): is the latent variable, representing the ability of a player and had estimated from standard normal distribution.((\(\lambda_{std} \sim N(0,1)\);\(\lambda \sim N(0,\lambda_{std}^2)\)))

    • \(h2hz_{player_i,player_j}\): is a matrix which assigns the upper triangular values following standard normal distribution and lower triangular values following normal distribution but it has to be closer to zero which is the effective zero assignment.

    • \(h2h_{player_i,player_j}\): is a matrix which assigns the upper triangular values using draws around a shared normal distribution. The lower triangle is the same effects of opposite sign. So basically there is a shared effect for player i, j but it adds or subtracts depending on the order of player and which player is predicted the chance of a win for.

    • p1_win: is the probability of player1 win with considering the head-to-head effects between each pair of player. In addition, based on the Bradley-Terry Model, it has been used logistics function to evaluate the probability of player1 win on the logit scale which is mapped \((0,1) \to R\), so the inverse logistics function should be used to transform the log-odds value to real probability of player1 win which is mapped \(R \to (0,1)\) and the final structure for p1_win is shown as below:

\[Pr(win_1)=\frac{e^{(\lambda_{player1}-\lambda_{player0}+h2h_{player1,player0})}}{1+e^{(\lambda_{player1}-\lambda_{player0}+h2h_{player1,player0})}}\]

  • generated quantities: created a new vector for predictive value via binomial distribution

  • vb function: is used variational inference for finding optimal posterior distribution. In this case, using VB testing to examine the result, whether the model construction and diagnostics are reasonable approximation or not. It can also minimize the difference between actual distribution and approximation distribution and reduce the time consuming, which is a more effective and less uncertain method for checking the Stan model correctly.

  • sampling function: computes posterior samples via MCMC(Markov Chain Monte Carlo) method given a model previously compiled from stan_model (bpc_h2h_pred.stan). In this case, it is a more reliable way to find the posterior probability by final inferences.

Model output analysis:

1. MCMC Parameter diagnostics:

By default the sampling method to run 4 Markov chains of Hamiltonian Monte Carlo in parallel, each of those chains proceeds with 1000 warmup iterations and 1000 sampling iterations, totaling 4000 sampling iterations available for diagnostics and analysis.

To examine the model fitting, the MCMC parameter diagnostics should be built up after extracting summary statistics on each model parameter.

  • n_eff: the effective sample size
    • n_eff measures the effective sample size of that particular parameter which is the sum of the effectively independent sampling iterations across all chains(from a total 4000). The smaller n_eff, the greater the uncertainty associated with the corresponding parameter. If the ratio (n_eff/N) less than 0.001,indicating that the estimators are biased and significantly overestimate the true effective sample size(Gelman, Rubin, and others (1992)). Based on the figure 6.3,the distribution of ratio in two Stan models shows most of value are bigger than 0.5, suggesting that most of estimated parameter have better precision and there are no indications of problems in estimates of the effective sample size.
  • Rhat: the potential scale reduction statistic
    • Rhat statistic is roughly the ratio of the variance of a parameter when the data is pooled across all of the chains to the within-chain variance so it can helps tell us whether these parameters are so poorly sampled and measures the extent to which chains are reaching different conclusions. Vehtari et al. (2021) stated the Rhat should be in the range of 0.9 to 1.05 as the more idiosyncratically the chains behave if the Rhat value is far away from 1. Figure 6.4 displays the distribution of Rhat value in two Stan models and most of Rhat value are close to 1, implying that all chains have been converged to the same systematic behavior for a given parameter. So, the R hats look quite promising overall and suggest that the target posterior has been reached for the model parameters.
Evalute effective sample size in stan model

Figure 6.3: Evalute effective sample size in stan model

Evalute potential scale reduction statistic in stan model

Figure 6.4: Evalute potential scale reduction statistic in stan model

2.Head-to-head effect evaluation:

  • H2H effect parameter distributions:

Figure 6.5 represents the distribution of head-to-head effects are concentrated at 0 in two Stan models. The variation in h2h effects between the different players in the ATP model data was about 34%, which is higher than the variation with 17% in the WTA model data.

The distribution of head-to-head effects in stan model

Figure 6.5: The distribution of head-to-head effects in stan model

  • H2H effect in ATP data:

Figure 6.6: Head-to-head effect for each pair of players in ATP data

By filtering the negative mean value of h2h effects which corresponds to a player (player1) who is disadvantaged or has lower win chance against with their opponent (player0), the table 6.6 shows the head-to-head effects in different pairs and ranked players. The largest h2h effect is 0.26 between Dominic Thiem and David Goffin, where Dominic has a lower win chance in their matches. In fact, These two have faced each other 9 times in their career with David leading 7 to 2 based on the historical data. However, Dominic is ranked higher than David and Elo rating shows that the point difference between these two players was over 100 with Dominic ahead, which means implies that the favorite has a 64% chance of winning(Vaughan et al. 2021). It is surprisingly observed the result that David’s advantage over Dominic is lopsided even though considering the player’s ranking and Elo rating board.

The another similar situation happened between Stan Wawrinka and Marin Cilic, who exist the second largest matchup effects. In the last ten years, Wawrinka has won all the matches when they have faced each other 7 times overall. It is worth noting that Wawrinka’s ranking is higher than Cilic now, but Cilic was surpassed Warinka in the ATP tour ranking in 2016,2017 and the difference was significant especially in 2018.

Generally speaking, the higher ranked players trend to have better level of tennis skills or more talented whereas they are also likely to suffer failure when facing specific opponents,like Gilles Simon against Gael Monfils, who are more than 50 places apart in the rankings and the h2h effect with 0.25 shows Gilles has got a 25% boost on his odds when facing Monfils. So it is not surprising that Gilles won 6 out of 8 matches with Monfils.

On the other hand, there are also some interesting facts, for example, Novak Djokovic and Daniil Medvedev are the world’s No. 1 and No. 2 tennis players, but they have a greater chance to lose the match in the face of Nick Kyrgios and Lucas Pouille whose ranking is No.56 and No.85. Also, in the 100 biggest head-to-head effects, Lucas Pouille as the most frequently occurring in the beneficiary side of the player,having the benefit of the matchup class for 6 times, with the biggest edge being over Richard Gasquet. And Roberto Bautista Agut features 8 times in the top 100 overall, three of which were in the 10 biggest matchup effects. Roberto has disadvantaged with Fabio Fognini and Milos Raonic but had a better chance of winning when facing with Benoit Paire.

Federer and Rafael Nadal are widely regarded as the two greatest tennis players of all the time and the rivalry between them is considered as a modern-day tennis rivalry, also known as the famous Federer–Nadal rivalry (Bodo 2010). Even though the head-to-head effect between them gives 2% boost to Nadal which is not significant, it still can explain the some facts that Nadal won 9 of 16 matches over Federer and he had won 4 times over Federer in Clay in the 4 matches as well. Moreover, the Clay Elo rating of Nadal was higher than Federer in the last decay, implying that Nadal is more capable than Federer, especially when the match takes place in Clay, which may explain the matchup effects. Moreover, Nadal have occurred 5 times as the disadvantaged players in the 100 biggest head-to-head effects, which is the most.

  • H2H effect in WTA data:

Figure 6.7: Head-to-head effect for each pair of players in WTA data

Table 6.7 represents the head-to-head effects between each pair of player from WTA model and the bigger the number, the more favorable for player0, that is, the greater the chance of winning the game. The biggest one is the match between Kirsten Flipkens and Donna Vekic. They met five times between 2010 and 2020, and Flipkens won the match every time. Donna is higher than Flipkens in the player ranking and the Elo rating system, while Flipkens is more favorable when matching with Donna, with 10% improvement in her win odds.

It is interesting to see several players occur multiple times in the top 10 matchup effects. Sloane Stephens,Heather Watson and Caroline Garcia all appear twice among the top 10 biggest head-to-head effects, but only Caroline is favored in both sides. Moreover, Heather has the biggest advantage when facing with Stephens which won 4 times in the 6 matches and Misaki Doi is the next one who lost the matches all the time over Heather. Except for Heather, Stephens also has a lower chance to win the matches over Johanna Konta. Based on the historical match records, Konta had won all four of their previous matches, all played in the last years. Only once did Stephens win the first games, in the Wimbledon in 2019, taking the attack with slapping forehand winners cross-court, but Konta still won the match.

Looking across all of the top 100 head-to-head effect, there are several players who seem to be candidates for clashes of game style with their competitors. Caroline has a more potential opportunity to win over Victoria Azarenka and so it is the fact that Caroline always finds the way to win the matches when meeting with Azarenke. However, the situation is completely different if Caroline faced with Barbora Strycova. In the last decay, Barbora and Caroline fought each other five times, but Caroline lost all the time. Their ranking look similar and one of the reasons is to explain that the clashes of playing styles between them. Caroline is an offensive baseliner like Azarenka, with consistent and powerful ground strokes, whereas Barobora is a serve-and-volley player, with hitting the first and finishing volleys to end the point. The point construction is served, and it is difficult to return from which to hit a first volley so it may can explain why Caroline would have a disadvantage over Barbora.

Another pair of player came out in the 10 biggest matchup effect is probably surprising even eager tennis fans. This head-to-head effect about Elina Svitolina and Ashleigh Barty, where Elina has more advantages. Barty is the most popular player in the world and her ranking is the first one now as well. But they met four times before 2020 and Barty lost all the matches. On the other hand, it was memorable that Naomi beat Serena in the chaotic 2018 US Open final and reached her fourth major title match. Although the figure like the 5% improvement on odds is not significant from the table 6.7, it can be explained to a certain extent that Naomi has the benefit of the matchup effect with Serena.

6.2.2.2 Random effects in Bradley Terry model

Modeling process:

  • bpc function: is from the bpcs package which performs Bayesian estimation of Paired Comparison models utilizing Stan. There are three different surface types related to the matches so the cluster parameter should select in the surface column as the benchmarks for evaluating the ability of players in different surface. Once adding the random effect in the Bradley Terry model, the model type should become bt-U.

Model output analysis:

  1. Ability table with different surface in ATP data :

Figure 6.8: The most highest estimated ability of ATP players in different surface

From the table 6.8,it shows the most highest estimated ability of different players in the particular surface. It is not surprising to see that the estimated ability for Daniil Medvedev is the highest in the Hard court. As the matchup effect table has been shown before, Daniil Medvedev is in the fourth spot who has more potential chance to win over Stefanos Tsitsipas. Looking back the historical match records, Daniil won all four matches in the Hard, which can be reflected that Daniil has advanced tennis skills in the Hard court. In terms of the estimated players’ ability to play, Dominic Thiem is the highest in the Clay court. Although the head-to-head effect between Dominic and David Goffin is the highest in overall where Dominic would at a disadvantage, Dominic still won 2 matches in the Clay court. Nadal is in the third place in the Clay court and it is not even unexpected at all. One of the examples can be explained in that is Nadal-Federer rivalry which has been discussed above. In fact, a lot of people have thought Federer has a bad record against Nadal because he has a single-handed backhand and Nadal is a lefty with a lot of spin, but they don’t adjust for the fact of Nadal’s extreme ability on clay which could be seen in the table 6.8,and this also suggests that their matchup is overblown. For the grass court, the estimated ability for Marin Cilic is the highest. In fact, the grass court favors a serve and volley style of play and Marin is exactly this type of player who serving moves quickly towards the net after hitting a serve or produce an effective returning volley in quick movement.

  1. Ability table with different surface in WTA data :

Figure 6.9: The most highest estimated ability of WTA players in different surface

Anastasia has the most advanced tennis ability on the hard court based on the table 6.9. Since Anastasia has played 125 matches, almost the most on hard court, and won more than half, Anastasia’s ability should be higher. It is worth noting that Naomi and Sabalenka are in the fifth and eighth spot as well, both of them are the top 10 players in the world. In terms of the estimated players’ ability to play, Tamara Zidansek is the highest in the clay court. Although Tamara claimed the first title on the pro-level in 2014, she did not have many match records in the model data, but it is not hard to evaluate for that Tamara met with other 14 players in the 19 matches and won 15 times between them, indicating that Tamara has an advantage in the clay court. Similarly, Kristina Mladenovic is in the second spot of the top 10 players who has the most advanced skills in the clay court. Kristina won 25 out of 32 matches base on the past records, showing her better performance when playing on the clay court. It is interesting to see the estimated ability on the grass court for Alison Riske is greater than 1, and is also the highest among all players. In fact, Alison played 30 matched in total and won 22 times, so the probability of Alinson win is around 73%, the highest of all players. Barbora Strycova is in the second place in the list, winning 17 of 27 matches with a 63% chance. Furthermore, although only a few matches occurred in the carpet court, Martincova has the highest expected ability, won all the matches over other players.

7 Model Evaluation

After modeling the data, the predicted probability with each match will generate from different models. In addition, this project focuses on making probabilistic predictions on the overall outcome of each match. The predictions therefore take the form of a single value \(P_{w}\), which is the probability assigned by a given model to the winning player. Since there are no draws and the probability of the losing player can be inferred as \(P_{l}=1-P_{w}\), then a single value suffices for all predictions. For evaluating the model performances such as the model predictions are close to the actual value or not, one of key method is the calibration.

Calibration is a measure of the model performance, which can be defined as how well the forecasted probabilities correspond to the actual outcomes and can be useful in order to better understand the behavior and biases of different models (Kovalchik 2016). Basically, the probabilistic prediction can be well calibrated if the observed winning probability is close to \(\alpha\) when considering all matches with predicted probabilities \(\alpha\). In order to identify if this condition is satisfied, the calibration curve is established. Firstly, data will be grouped into 10 bins ordered by their predicted probability, 0-10%, 10-20%, etc. and the mean predicted probability will be calculated in each bin. And then the mean predicted probability will be compared with the mean the actual probability in each bin which is the sum of the matches of players win divided by the sum of the number of matches between each pair players played.

When a model is well-calibrated, the mean actual probability against with a mean predicted probability will close with each other. If the mean actual probability is greater than the mean predicted probability, the models trend to be underestimating the wins of players, the reverse is an overestimate.

7.1 Standard Bradley-Terry Model

In the Bradley-Terry model, using the predict function to predict the probability of player 0 win in the competition between each pair of players, and then grouping the expected probability into 10 bins to calculate the mean of expected probability in each bin. Table 7.1 shows the comparison between the mean actual probability and the mean predictive probability in each bin in the ATP and WTA. The larger the positive difference is, the greater the underestimation of the winning rate is. At the same time, the larger the negative difference is, the more overestimated the winning rate is. In order to directly visualize the difference between the expected value against with actuality, the scatter plot 7.1 is constructed as the calibration curve to evaluate the model performance.

Therefore, from the figure 7.1, the estimated winning probability of player 0 is close to the actual value, especially in the range of 0.5 to 0.75 in ATP data. There is a little bit of volatility outside the range. Looking back at the table 7.1 , it is obvious that the difference between the expected value against with reality is smaller and positive except in the bins 2,3,5 and 9. The biggest differences is in the bins 2 and it can be visualized in the plot as well where has the most fluctuating in the 12%-15%.

A similar way to access the model performance in WTA data. The closest to real probability is in the fourth to ninth bins where all differences are almost after second decimal places. It is interesting to see there is a turning point after 84%, which is the biggest difference as well compared with others, is 0,0786 in the tenth bin. The reason for that is the mean actual probability is 1 which all the players 0 won all the matches in the tenth bin, but expected probabilities for them are around 0.92, it may cause the winning probabilities of player 0 will be underestimated in the tenth bin.

Although there are differences in each box, the difference is small and can be ignored. Moreover, most of the predicted values in the scatter plot 7.1 fit well with the actual probability. Overall, the model prediction is close to the actual probability of player 0 wins.

Table 7.1: Calibration table for standard Bradley-Terry model
bins atp_mean_act atp_mean_pred wta_mean_act wta_mean_pred atp_diff wta_diff
1 0.0686 0.0659 0.0357 0.0662 0.0027 -0.0305
2 0.1250 0.1596 0.1444 0.1551 -0.0346 -0.0107
3 0.2742 0.2520 0.2963 0.2523 0.0222 0.0440
4 0.3642 0.3518 0.3548 0.3506 0.0125 0.0043
5 0.4315 0.4484 0.4373 0.4490 -0.0168 -0.0117
6 0.5510 0.5483 0.5514 0.5468 0.0027 0.0046
7 0.6562 0.6475 0.6468 0.6451 0.0087 0.0017
8 0.7538 0.7485 0.7500 0.7423 0.0053 0.0077
9 0.8153 0.8420 0.8425 0.8360 -0.0266 0.0065
10 0.9362 0.9384 1.0000 0.9232 -0.0022 0.0768
Calibration curve in Bradley Terry Model

Figure 7.1: Calibration curve in Bradley Terry Model

7.2 Bayesian paired comparison model

7.2.1 Head-to-head effect

Since the Stan model used the binomial distribution as the predictive model, the predicted value of the head-to-head effect in the Bayesian paired comparison model is no longer be probability,but the number of matches player 1 wins. And, the predicted value should be an integer because of the definition of binomial distribution, so it is reasonable to choose the prediction with 50% as the final prediction of the match that player 1 will win.

It is clear that there are two segments of the WTA data with the greatest volatility in the figure 7.2, one trend to be concave, the other one is convex. The fluctuation in ATP data seems to be lesser than WTA data and only the first and last bins deviated from the 45 degree line, and the rest seemed to fit well.

Rather than only visualizing the relationship between expected probability and actual probability from the figure, combine with the table 7.2 to analyze why such a situation appears. Based on the table 7.2, the most interesting thing for WTA data is in the last two bins where the mean actual probability is 1 in the ninth bin and the mean predicted probability is 1 in the tenth bin, causing the convex issue. In fact, the plot 7.4 which can be interactively visualized the distribution of number of matches with different times of meeting in each bin in WTA data and it is clear that in the ninth bin, only one pair player like Victoria Azarenka with Alize Cornet appears and Azarenka won all matches in 6 times but the predicted matches of Azarenka win are 5, which causes the difference between mean of actual probability and expected value. However, the frequency of most of the matches in the tenth bin is concentrated in 1 and there are only 54 matches that players have met twice, suggesting that the matchup effects among them who only have played one time are unstable. A similar situation for ATP data can be seen in the plot 7.3, there are more than 400 matches that players have played one time and only 27 matches that players have played more than three time, indicating that the size of players met one time is larger than players met more than three times so that the matchup effects between them are abnormal or variable.

Another point of concern is that the mean predicted probability are zero in both first bins. One of the reasons is that the sample size of the matchup effects is extremely small. It can be seen from the figure 7.3 and 7.4, there are more than 400 matches that players have only met one time with their opponent in the ATP and WTA, indicating that head-to-head effects between them are less than other which causes all the predicted matches of player 1 win are zero in this case.

In general, the model performance with considering head-to-head effect is better in terms of ATP data rather than WTA data, but both of them can be accepted in this project even though they are not perfectly fitted to the data due to the extreme predicted value.

Table 7.2: Calibration table for evaluating head-to-head effect
bins atp_mean_pred atp_mean_act wta_mean_pred wta_mean_act atp_diff wta_diff
1 0.0000 0.2549 0.0000 0.3000 0.2549 0.3000
2 0.1820 0.1695 0.1871 0.1579 -0.0125 -0.0292
3 0.2539 0.2402 0.2548 0.2500 -0.0137 -0.0048
4 0.3483 0.3658 0.3412 0.3743 0.0175 0.0331
5 0.4982 0.4889 0.4990 0.4849 -0.0093 -0.0141
6 0.5947 0.6318 0.5949 0.6347 0.0371 0.0399
7 0.6646 0.6488 0.6658 0.6115 -0.0158 -0.0543
8 0.7568 0.7477 0.7558 0.7819 -0.0091 0.0261
9 0.8530 0.8867 0.8333 1.0000 0.0337 0.1667
10 1.0000 0.7837 1.0000 0.7044 -0.2163 -0.2956
Calibration curve in head-to-head effect Stan model

Figure 7.2: Calibration curve in head-to-head effect Stan model

Figure 7.3: The distribution of the number of matches that ATP players meet in the number of times

Figure 7.4: The distribution of the number of matches that WTA players meet in the number of times

7.2.2 Random effect in bayesian paired comparison model

For finding the expected probability in each match, using the predict function for bpc objects which are derived from the bpc function with surface random effects. It will generate the 100 iteration by default and the get the mean probability of each column, which will be the final probability for each match. And then the mean probability of each pair player has to be calculated through sum of expected probability in one match divided by the sum of the matches they played. The final data will be combined with the predicted probability against with the actual number of matches of player 0.

Since only a few matches happened in the carpet court which may raise more uncertainty for evaluating this surface type in the model, it can give more explanation by combining carpet and grass court together. So, the figure 7.7 can be visualized directly how well the model fit in the data which has been considered the surface random effect. Based on the scatter plot, the performance of the mixed effect models in the data is relatively poor especially in WTA data. The volatility between prediction and actual probability in WTA is larger than in ATP and all the predicted values are inconsistent with the reality or even close to the actual value in both plots.

Further explanation needs to be explored from the table 7.5 and 7.6 which display the mean of expected probability with different types of surface against the actual probability in each bin in ATP data and WTA data. In the table 7.5, the mean of expected probability increase gradually in all surfaces, but the actual value seems to be concentrated in 0.4 to 0.5 in the third to eighth bin. One of the interesting things is the difference between expected probability and actuality are smallest in the fifth bin in three surfaces. Before the fifth bin, the mixed effect model underestimated the probabilities of player 0 wins, but after that, the probabilities of player 0 wins are overestimated.

On the other hand, in the table 7.6,the differences between expected probability and real value in WTA are completely different from the ATP table as the probabilities of player 0 wins in all surfaces are overestimated from the mixed effect model before the fifth bin and are underestimated after the fifth bin. It is worth noting that the expected probability of player 0 wins in the third bin is the most underestimated, with an average probability of about 65%, while the average predicted value is only 26%. That is the peak in the grass court that is shown in the right side of scatter plot 7.7.

Obviously, the performances of the mixed effect model in both data sets are not good and the accuracy of model fitting is very low, therefore the mixed effect model about considering the surface random effect is not the appropriate method to predict the tennis matches based on the calibration results and it may take more debugging to improve the model fitting.

Figure 7.5: Calibration table for evaluating the random effect in ATP data

Figure 7.6: Calibration table for evaluating the random effect in WTA data

Calibration curve for surface random effect in bayesian paired comparison model

Figure 7.7: Calibration curve for surface random effect in bayesian paired comparison model

8 Conclusion

This project seeks to develop and compare the different approaches for predicting the tennis match outcomes. The main method to construct the model structure is the Bradley-Terry model which is one of the most effective ways to apply the paired comparison model (McHale and Morton 2011). Based on this, the head-to-head effect would be considered as the explanatory factor in the model so that the Bayesian paired comparison model was introduced by capturing these matchup effects. In addition, not only considering the head-to-head effect, but also exploring the ability of each player in different types of surface. The mixed effect model with surface random effects was implemented in this project for achieving the goal. After modeling the data, the calibration would be evaluated the model fitting and model performance in three models.

For the standard Bradley-Terry model, the model fitting was good and it trained the data to evaluate the ability of each player. The model performance in the ATP and WTA data was well and the predicted probabilities were closer to the actual value. However, the standard Bradley-Terry model only implemented based on the match level information such as the number of matches with player 0 wins against with player 0 loses, which was not enough to construct the prediction model for tennis matches(Jackson and Mosurski 1997). The result of the player - opponent match sometimes is not only about the different capability between two players, it also about the playing style or abilities in different surface courts.

The findings of the head-to-head model can be explained some abnormal situation like Wawrinka over Cilic, who lost all the matches in 7 times and their matchup effect was in the second place compared with other. Also,fitting the models with a Bayesian approach results in performance, which is marginally better than the standard Bradley-Terry model even though the volatility in the WTA was larger. Additionally, for the ATP data, this model gives a slight improvement in performance compared to a standard point level Bradley-Terry model, which was closer to the actual probability after considering the head-to-head effect in the matches.

Finally, the mixed effect model was performed in both data sets, which considered the different surface courts in the match records that evaluated the ability of each player based on the surface courts. The performance of this model was relatively poor in both data sets especially of WTA data. Also, the model fitting of these two data was very low and neither of them has achieved the expected effect. This may need more further explore debugging, such as adjusting the prior parameter setting. However,this model showed that the expected ability of each player in different surface courts was reasonable compared with the reality.

8.1 Further work

8.1.1 Expanding the Data Set

In this project, the full data set with ATP and WTA historical match records should be theoretically used to construct the model and evaluate the model performance. However, because of the large amount of the full data, the system runs slowly to finish the modeling process, especially in the head-to-head model, it even took more than 2 days to run without any results. Therefore, by filtering the top 100 players to develop the model rather than the full data, which may cause the problems like incomplete information as there was only a small amount of higher ranked player (Top 100 players) data available. One of the improvements from this project is to use the complete data, but it may need to work out how to improve the system efficiency during the modeling.

8.1.2 Combing the Surface Factor and Head-to-Head Models

In the mixed effect model, it has been demonstrated that the variation of player skills between different surface courts could be explained despite the accuracy of model prediction was very low. Future work could implement the surface random effect factors into the head-to-head effect model, which would provide more possibility and information for exploration of the head-to-head effect.

8.1.3 Additional Factors

The other improvement of this project is to take player ranking and Elo score as exploratory factors in the model, which can improve the accuracy and practicability of tennis prediction. Additionally,the Elo-based prediction system created by the website FiveThirtyEight.com(Morris, Bialik, and Boice 2016), which would be considered as the further model development well(Vaughan et al. 2021).

9 Acknowlegments

The R packages are used in this report: tidyverse(Wickham et al. 2019); dplyr(Wickham et al. 2020); ggplot2(Wickham 2016); here(Müller 2017); kableExtra(Zhu 2019); ggwordcloud(Le Pennec and Slowikowski 2019); scales(Wickham and Seidel 2020); formattable(Ren and Russell 2016); bpcs(Issa Mattos and Martins Silva Ramos 2020); BradleyTerry2(Turner and Firth 2012); rvest(Wickham 2021); rstan(Stan Development Team 2020); bayesplot(Gabry et al. 2019); ggpubr(Kassambara 2020); DT(Xie, Cheng, and Tan 2020);plotly(Sievert 2020)

Reference

Bodo, Peter. 2010. The Clay Ran Red: Roger Federer Vs. Rafael Nadal at Roland Garros. Diversion Books.

Böckenholt, Ulf. 2001. “Hierarchical Modeling of Paired Comparison Data.” Psychological Methods 6 (1): 49.

Bradley, Ralph Allan, and Milton E Terry. 1952. “Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons.” Biometrika 39 (3/4): 324–45.

Gabry, Jonah, Daniel Simpson, Aki Vehtari, Michael Betancourt, and Andrew Gelman. 2019. “Visualization in Bayesian Workflow.” J. R. Stat. Soc. A 182 (2): 389–402. https://doi.org/10.1111/rssa.12378.

Gelman, Andrew, Donald B Rubin, and others. 1992. “Inference from Iterative Simulation Using Multiple Sequences.” Statistical Science 7 (4): 457–72.

Issa Mattos, David, and Erika Martins Silva Ramos. 2020. “bpc: A Package for Bayesian Paired Comparison Analysis.”

Jackson, David, and Krzysztof Mosurski. 1997. “Heavy Defeats in Tennis: Psychological Momentum or Random Effect?” Chance 10 (2): 27–34.

Kassambara, Alboukadel. 2020. Ggpubr: ’Ggplot2’ Based Publication Ready Plots. https://CRAN.R-project.org/package=ggpubr.

Klaassen, Franc JGM, and Jan R Magnus. 2001. “Are Points in Tennis Independent and Identically Distributed? Evidence from a Dynamic Binary Panel Data Model.” Journal of the American Statistical Association 96 (454): 500–509.

Kovalchik, Stephanie Ann. 2016. “Searching for the Goat of Tennis Win Prediction.” Journal of Quantitative Analysis in Sports 12 (3): 127–38.

Le Pennec, Erwan, and Kamil Slowikowski. 2019. Ggwordcloud: A Word Cloud Geom for ’Ggplot2’. https://CRAN.R-project.org/package=ggwordcloud.

McHale, Ian, and Alex Morton. 2011. “A Bradley-Terry Type Model for Forecasting Tennis Match Results.” International Journal of Forecasting 27 (2): 619–30.

Morris, B, C Bialik, and J Boice. 2016. “How We’re Forecasting the 2016 Us Open.” Also Available at Https://Fivethirtyeight. Com/Features/How-Were-Forecasting-the-2016-Us-Open.

Müller, Kirill. 2017. Here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.

Peters, Joss. n.d. “Predicting the Outcomes of Professional Tennis Matches.”

Ren, Kun, and Kenton Russell. 2016. Formattable: Create ’Formattable’ Data Structures. https://CRAN.R-project.org/package=formattable.

Sievert, Carson. 2020. Interactive Web-Based Data Visualization with R, Plotly, and Shiny. Chapman; Hall/CRC. https://plotly-r.com.

Stan Development Team. 2020. “RStan: The R Interface to Stan.” http://mc-stan.org/.

Turner, Heather, and David Firth. 2012. “Bradley-Terry Models in R: The BradleyTerry2 Package.” Journal of Statistical Software 48 (9): 1–21. https://www.jstatsoft.org/v48/i09/.

Turner, Heather, David Firth, and others. 2012. “Bradley-Terry Models in R: The Bradleyterry2 Package.” Journal of Statistical Software 48 (9).

Vaughan, Williams Leighton, Chunping Liu, Lerato Dixon, Hannah Gerrard, and others. 2021. “How Well Do Elo-Based Ratings Predict Professional Tennis Matches?” Journal of Quantitative Analysis in Sports 17 (2): 91–105.

Vehtari, Aki, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner. 2021. “Rank-Normalization, Folding, and Localization: An Improved R for Assessing Convergence of Mcmc.” Bayesian Analysis 1 (1): 1–28.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

———. 2021. Rvest: Easily Harvest (Scrape) Web Pages. https://CRAN.R-project.org/package=rvest.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2020. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Wickham, Hadley, and Dana Seidel. 2020. Scales: Scale Functions for Visualization. https://CRAN.R-project.org/package=scales.

Xie, Yihui, Joe Cheng, and Xianying Tan. 2020. DT: A Wrapper of the Javascript Library ’Datatables’. https://CRAN.R-project.org/package=DT.

Zhu, Hao. 2019. KableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.